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Abstract 

In  this  paper  we  present  an  approach  for  detection  of  sim¬ 
ple  objects  in  RGB-D  data.  Object  detection  in  cluttered 
indoors  environments  is  an  important  perceptual  capa¬ 
bility  of  robotic  systems  required  for  object  search  and 
pick  and  deliver  tasks.  For  long  term  autonomy  robots 
should  learn  how  objects  look  like  and  where  they  ap¬ 
pear  in  an  weakly  supervised  manner.  In  this  work  we 
exploit  the  depth  information  to  provide  evidence  about 
occlusion  boundaries  and  scale  of  the  objects.  The  depth 
discontinuities  along  with  image  contours  computed 
in  the  vicinity  of  the  detection  window  boundary  form 
an  objectness  measure,  which  is  used  to  train  an  SVM 
classifier.  In  the  testing  stage  we  exploit  the  knowledge 
of  the  actual  size  of  the  object  to  propose  the  scale  of 
the  detection  window  significantly  pruning  the  number 
window  candidates  to  be  evaluated.  We  evaluate  our 
approach  for  detecting  simple  objects  on  NYU  RGB-D 
dataset,  illustrate  the  effectiveness  of  our  approach  as 
well  as  difficulties  with  the  standard  evaluation  method¬ 
ologies. 

With  the  advent  of  RGB-D  cameras  in  recent  years  sev¬ 
eral  approaches  towards  object  detection,  semantic  seg¬ 
mentation  and  activity  recognition  as  well  as  more  gen¬ 
eral  scene  understanding  have  been  developed  [14, 12, 9]. 
The  proposed  approaches  demonstrated  the  improved 
performance  compared  to  purely  image  based  methods 
thanks  to  availability  of  the  depth  data  .  Due  to  the  range 
limitations  of  the  sensor,  most  of  the  proposed  methods 
focus  on  indoor  environments.  Different  datasets  have 
been  proposed  by  researchers  which  are  used  for  evalua¬ 
tion  of  respectively  for  semantic  segmentation  [16, 11], 
object  detection  and  categorization  [12]  and  localiza¬ 
tion  [17].  These  problems  and  class  of  environments 
commonly  considered  have  are  closely  motivated  by 
issues  related  to  robot  perception. 

In  order  to  enable  long  term  robot  autonomy  and  facil¬ 


itate  the  more  sophisticated  robotics  tasks,  it  is  important 
that  robots  can  localize  objects  at  different  scales  in  clut¬ 
tered  environments.  In  robotic  setting  the  capability  of 
generating  hypotheses  about  presence  of  objects  with 
particular  aspect  ratio  and  of  particular  size  is  of  interest 
for  tasks  like  object  search,  which  precedes  closer  catego¬ 
rization,  more  detailed  segmentation  and  manipulation. 
Hence  considering  this  capability  in  the  context  of  object 
search,  it  is  also  reasonable  to  assume  that  the  actual  size 
of  the  object  to  be  located  is  known. 


Figure  1:  (a)  Example  scenes,  with  small  simple  objects 
and  their  bounding  boxes  from  NYY  RGB-D  VI  dataset 
(b)  Ground  Truth  labeling  associated  with  the  dataset, 
focuses  typically  on  large  regions.  Many  small  objects 
are  missed. 

The  goal  of  this  paper  is  to  advance  the  state  of  the  art 
of  detection  of  simple  objects  in  cluttered  RGB-D  scenes. 
We  consider  simple  objects  where  the  apparent  size  of 
the  object  is  possibly  small  and  object's  bounding  box 
approximates  well  the  extent  of  the  object.  Some  related 
works  approach  this  problem  by  means  of  semantic  seg¬ 
mentation  of  the  entire  image,  or  use  models  of  human 


1 


attention  to  generate  possible  hypotheses  about  object 
location  and  size.  In  our  approach  we  pursue  the  slid¬ 
ing  window  approach  to  object  detection  and  make  the 
following  contributions:  (i)  We  define  an  objectness  mea¬ 
sure  computed  over  windows  of  both  images  and  depth 
maps  and  use  it  to  train  a  SVM  classifier  for  scoring  the 
windows  as  object  or  background;  (ii)  The  classifier  is 
trained  on  all  bounding  boxes  regardless  of  the  object 
size  and  aspect  ratio;  (iii)  In  the  detection  stage  we  sam¬ 
ple  the  actual  object  sizes  to  determine  the  scale  of  the 
window,  significantly  pruning  the  number  of  windows 
which  need  to  be  evaluated.  The  proposed  approach 
is  evaluated  on  a  subset  of  scenes  in  NYU  RGB-D  VI 
dataset,  demonstrating  the  performance  of  the  detection, 
compared  to  ground  truth  labeling. 

1  Related  work 

The  proposed  work  is  related  to  several  areas  of  research 
including  semantic  segmentation,  object  detection  and 
saliency  detection.  While  there  is  a  large  body  of  ap¬ 
proaches  which  study  these  problems  in  the  context  of 
images  only,  we  will  focus  here  on  the  methods  which 
exploit  the  depth  information. 

As  mentioned  before  the  nature  of  the  datasets  used 
to  evaluate  approaches  to  semantic  segmentation  and 
object  detection  and  segmentation  differ  in  their  charac¬ 
teristics.  The  most  important  one  is  the  scale  at  which 
objects  appear  in  images.  A  successful  approach  to  ob¬ 
ject  detection  in  RGB-D  data  was  proposed  in  the  work 
on  [13],  where  the  objects  are  viewed  in  a  table  top 
setting  at  moderate  scale.  The  authors  formulated  the 
object  detection  problem  as  an  inference  on  a  voxel  grid, 
reconstructed  from  multiple  frames  of  RGB-D  data.  The 
final  inference  is  carried  in  MRF  framework,  where  the 
data  term  accumulates  evidence  from  the  sliding  win¬ 
dow  based  detectors  trained  on  different  views  of  the 
objects.  A  variant  of  the  HOG  descriptor  [12]  was  used 
for  capturing  the  appearance  and  shape  information  of 
each  view  of  an  object  and  trained  using  SVMs.  The 
outputs  of  multiple  HOG  detectors  and  multiple  views 
were  then  combined  to  generate  the  score  of  the  object 
presence  at  each  3D  point.  Additional  features  computed 
from  the  depth  channel  were  used  in  the  pairwise  term 
of  MRF  model  which  further  improved  the  object  local¬ 
ization  capability.  The  larger  extent  of  the  objects  in  the 
dataset  [12]  and  sufficient  number  of  training  examples 
made  the  use  of  HOG  detector  feasible.  Another  related 
work  on  unsupervised  object  discovery  [3]  has  shown 
promising  results  for  closer  range  and  small  amount  of 
clutter. 

In  the  presented  work  we  focus  on  the  localization 
of  simple  objects  in  cluttered  scene,  such  as  the  one  de¬ 
picted  in  Figure  1.  Instead  of  striving  to  achieve  com¬ 
plete  semantic  segmentation  of  these  types  of  scenes  as 
in  [16,  9],  we  instead  want  to  generate  simple  object  hy¬ 


potheses.  The  notion  of  a  simple  object  here  is  the  type  of 
object  whose  shape  can  be  well  delimited  by  a  bounding 
box.  Our  work  is  most  closely  related  to  work  of  [1]  who 
considers  the  problem  of  detection  of  generic  simple  ob¬ 
jects  in  an  unsupervised  setting.  Authors  in  [1]  use  the 
computation  of  the  boundary  using  both  RGB  and  depth 
data,  followed  by  a  selection  of  salient  points  and  bound¬ 
ary  completion.  This  methods  is  very  effective  on  closer 
range  table  top  settings,  where  both  depth  discontinu¬ 
ities  and  support  surfaces  can  be  well  estimated  and  the 
process  of  detection  of  image  contours  is  more  reliable. 
Their  methods  relies  on  a  high  quality  contour  detec¬ 
tor  [7],  which  is  quite  expensive  to  compute.  While  the 
produced  contours  are  of  high  quality,  the  subsequent 
processing  steps  rely  on  more  accurate  depth  estimates 
and  supporting  surfaces,  which  are  harder  to  attain  with 
varying  viewpoints  and  far  distance.  With  the  change 
of  scale  of  depth  measurements,  in  many  instances  the 
depth  measurements  are  missing  and  due  to  the  com¬ 
mon  use  of  image  in  painting  techniques  the  intensity 
and  depth  boundaries  are  not  well  aligned,  making  the 
contour  based  segmentations  techniques  very  brittle. 

Our  approach  is  closely  related  and  motivated  by 
work  of  [2],  who  proposed  a  method  for  generic  object 
detection  in  natural  images.  Authors  in  [2]  pursue  slid¬ 
ing  window  approach  and  learn  how  to  classify  generic 
backgrounds  from  object  categories  using  cues  character¬ 
izing  the  length  of  the  contour  close  to  the  boundary  slid¬ 
ing  window,  saliency  measure  and  difference  between 
color  histograms  in  the  outside  and  inside  of  the  bound¬ 
ing  box.  The  features  are  combined  in  Bayesian  frame¬ 
work  and  greedy  search  over  high  scoring  windows  of 
all  aspect  rations  and  scales  is  proposed  to  select  the 
top  candidates.  The  approach  performs  well  on  the  de¬ 
tection  of  isolated  and  often  small  number  of  objects  in 
outdoors  scenes  (as  tested  in  on  PASCAL-VOC  dataset). 
In  indoors  settings  due  to  large  amount  of  clutter  color 
contrast  feature  is  not  so  effective  and  the  window  scor¬ 
ing  strategy  along  with  greedy  approach  tends  to  selects 
windows  of  bigger  size,  missing  smaller  objects.  In  our 
work  we  also  use  the  idea  of  presence  of  the  contour 
close  to  object  (window)  boundary,  but  enhance  the  fea¬ 
tures  by  considering  also  the  depth  gradients,  which  are 
indicative  of  occluding  contours.  Instead  of  perform¬ 
ing  a  greedy  search,  we  use  the  depth  information  to 
select  the  scale  of  the  window  over  which  the  score  are 
computed,  hence  by  passing  the  search  over  all  possible 
aspect  ratios  and  scales. 

Another  class  of  methods  formulates  the  problem  of 
object  detection  using  over  segmentation  as  initial  repre¬ 
sentation  and  combines  local  evidence  such  as  shape,  ap¬ 
pearance  with  pairwise  interactions  between  regions  in 
a  MRF  framework.  [15, 11,  9].  The  segmentation  based 
approaches  deal  with  imperfect  segmentation  by  gen¬ 
erating  multiple  segmentations  and  aggregating  their 
results  to  form  hypothesis  about  regions. 

Biologically  motivated  approaches  towards  object  de- 
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tection  use  as  starting  point  various  saliency  measures 
which  are  then  enhanced  using  top-down  information, 
or  boosted  using  evidence  from  human  attention  mod¬ 
els.  In  Itti  [8]  the  problem  of  saliency  object  detection  is 
studied  jointly  with  the  object  search  problem,  where  a 
model  for  combining  bottom  and  top  down  cues  is  inves¬ 
tigated.  The  idea  of  combining  high  level  concepts  and 
low  level  features  to  improve  current  saliency  models 
as  well  as  to  scale  up  current  models  to  reach  the  hu¬ 
man  performance  has  been  explored  in  the  work  of  [5]. 
More  recently  the  role  of  depth  information  in  bottom  up 
saliency  models  have  been  studied  in  [6]  demonstrating 
that  the  availability  of  depth  information  affects  human 
fixation.  Authors  propose  to  incorporate  novel  depth 
saliency  priors  to  augment  existing  approaches  which 
used  only  appearance  information. 

The  goal  of  our  work  is  to  generate  hypotheses  about 
presence  of  objects  in  cluttered  scenes.  Examples  of  such 
scenes  can  be  found  in  Figure  1.  While  there  is  large 
variety  of  objects  and  object  classes,  we  are  interested  in 
detecting  smaller  objects  which  could  possibly  be  ma¬ 
nipulated.  Note  the  scenes  have  large  amount  of  clutter 
and  large  variety  of  objects  appearing  in  them.  For  our 
experiments  we  use  NYU  RGB-D  dataset  [16],  which 
as  been  introduced  recently  in  the  context  of  semantic 
segmentation. 


2  Approach 

In  this  section  we  describe  the  choice  of  the  features  and 
method  for  scoring  of  the  candidate  windows.  Simi¬ 
larly  as  [2]  our  approach  exploits  the  observation  of  the 
presence  of  the  object  boundary  in  the  vicinity  of  the 
bounding  box.  This  assumption  is  reasonable  provided 
that  the  objects  are  relatively  simple  shapes,  and  major¬ 
ity  of  the  true  object  boundary  is  close  to  the  bounding 
box  of  the  object  (Figure  2). 

Features  We  compute  the  gradient  orientations  in 
four  blocks  depicted  in  Figure  2c  obtained  by  shrink¬ 
ing  the  bounding  box  boundary  by  6 ^ar  %  of  the  size 
of  the  bounding  box.  The  value  6 ^  =  10%  has  been 
used  in  current  experiments.  We  also  enlarge  the  win¬ 
dows  of  the  ground  truth  bounding  boxes  by  5  pixels 
to  mitigate  some  of  the  labeling  errors.  In  each  block 
we  quantize  the  orientations  from  (0°-  360°)  into  9  ori¬ 
entation  bins.  This  is  done  both  for  intensity  and  depth 
channel  yielding  a  2  x  4  x  9  =  72-dimensional  feature. 
Prior  to  histogram  computation,  we  normalized  the  gra¬ 
dients  by  a  total  energy  in  the  bounding  box.  Average 
gradients  of  the  ground  truth  windows,  for  a  particular 
aspect  ratio  are  visualized  in  Figure  3  along  with  average 
gradients  of  the  windows  used  as  negative  examples. 

Object  size  We  exploit  in  our  approach  the  availability 
of  the  depth  data  in  order  to  properly  model  the  expected 
scale.  The  distributions  of  object  sizes  as  well  as  aspect 
ratios  are  learned  from  the  training  data  and  we  use  this 


Figure  2:  Examples  of  objects  and  their  bounding  boxes 
and  close-up  of  the  orientation  energy  for  both  intensity 
and  depth  channel,  a)  b)  orientation  energy  for  paper 
towel  dispenser,  where  the  image  gradients  and  depth 
gradients  complement  each  other  well;  c)  an  example  of 
an  object  the  strong  orientation  energy  in  the  vicinity  of 
the  boundary  occurs  only  at  few  locations. 


Figure  3:  First  and  second  columns  are  average  depth 
gradients  of  positive  examples  from  kitchen  and  bath¬ 
room  datasets  respectively.  The  last  columns  are  the 
gradients  computed  over  negative  examples.  The  two 
rows  visualize  the  averages  for  two  different  aspect  ra¬ 
tios. 


prior  knowledge  to  speed  up  the  process  of  windows 
sampling  in  testing  stage.  In  our  experiment  less  than 
10,000  candidate  windows  are  generated  for  the  entire 
image  at  full  resolution  of  480  x  640.  We  firstly  discretize 
the  aspect  ratios  available  in  the  training  data  into  10 
bins.  For  any  pixel  (x,  y)  in  the  image,  the  corresponding 
point  in  the  world  coordinate  (X,  Y)  can  be  obtained  as 
X  =  and  Y  =  using  the  median  depth  value  Z  in 
the  bounding  box.  So  for  any  two  image  points 
(x2/y2)  we  have  5x  =  x\  —  x2  =  j(Xi  —  X2)  =  5X. 
This  means  the  scale  of  an  object  at  some  distance  can 
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be  determined  by  its  aspect  ratio  and  depth.  For  each 
aspect  ration  bin  all  possible  object  sizes  are  found  by 
agglomerative  clustering.  One  example  of  generated 
windows  at  some  positions  are  shown  in  Figure  4,  from 
which  we  can  notice  the  effectiveness  of  our  approach. 


Figure  4:  Candidate  windows  in  some  locations,  the  red 
box  is  the  ground  truth  of  an  object  while  those  green 
ones  are  proposed  by  our  approach.  Left  is  a  bathroom 
scene  and  right  is  a  kitchen  scene 


3  Experimental  Setup 

We  carry  out  our  experiment  on  NYU  RGB-D  VI  [16] 
dataset,  which  contains  7  different  scene  classes  which  in 
total  has  64  scenes  and  108617  frames.  In  the  reported  re¬ 
sults,  we  only  focus  on  the  bathroom  and  kitchen  scenes, 
which  contain  many  simple  objects  (e.g.  containers)  .  By 
filtering  out  those  frames  whose  scene  class  has  been 
wrongly  assigned,  we  get  70  frames  of  6  scenes  for  bath¬ 
room,  and  276  frames  of  10  different  kitchen  scenes. 

The  NYU  dataset  is  typically  used  for  evaluation  of  ap¬ 
proaches  for  semantic  segmentation.  As  a  consequence 
many  small  objects  are  not  labelled  and  in  many  cases 
the  location  of  bounding  boxes  is  not  accurate  and  some 
bounding  boxes  are  entirely  missing.  The  labels  are 
coarse  (many  objects  are  missing)  and  inaccurate  (A 
frame  with  its  labels  is  given  in  Figure  1).  Secondly, 
the  number  of  labeled  objects  is  very  small,  which  is 
insufficient  for  training.  For  the  presented  evaluation 
we  firstly  filter  out  the  non-object  labels  and  keep  the 
remaining  regions  and  their  associated  labels.  In  order  to 
get  larger  number  of  training  examples,  we  then  sample 
around  the  ground  truth  bounding  boxes  to  obtain  more 
positive  training  examples.  The  negative  examples  are 
generated  by  uniformly  sampling  in  the  entire  image 
(100,000)  and  filtering  out  those  having  a  high  overlap¬ 
ping  with  object  windows  (PASCAL  score  greater  than 
0.5).  Finally,  we  get  around  20,000  to  40,000  negative 
examples  for  one  frame. 

3.1  Training  Stage 

For  each  setting,  2/3  of  the  examples  is  randomly  se¬ 
lected  for  training.  Descriptors  are  computed  on  both 


positive  and  negative  examples.  We  evaluate  the  per¬ 
formance  of  the  proposed  image  descriptor,  depth  de¬ 
scriptors  and  concatenation  of  them.  For  classification 
we  used  SVM  with  intersection  kernel.  Because  the  num¬ 
ber  of  negative  and  positive  examples  is  unbalanced,  for 
instance,  there  might  be  about  25,000  negative  examples 
but  only  about  100  positive  ones  in  a  frame,  so  in  this 
stage  we  experimented  different  ratio  of  the  number  of 
negative  and  positive  examples.  As  expected,  the  more 
negative  examples,  the  higher  true  negative  rate.  But 
the  true  positive  rate  decreases  as  a  result  although  not 
dramatically.  So  in  testing  stage,  we  use  the  classifier 
learned  with  balanced  number  of  positive  and  negative 
examples.  In  balanced  case,  we  have  2041  positive  exam¬ 
ples  and  2058  negative  examples  for  kitchen  setting;  and 
for  bathroom  setting  we  have  1779  positive  examples 
and  1758  negative  examples. 

3.2  Testing  Stage 

Our  experiments  consists  of  two  parts.  Firstly,  we  evalu¬ 
ate  our  classifier  on  the  object  and  non-object  windows 
in  the  test  data.  Then  the  evaluation  is  performed  on  all 
frames,  with  the  windows  proposed  by  our  algorithm. 
Testing  on  known  windows.  For  each  frame  in  test  data, 
we  compute  descriptors  for  both  positive  and  negative 
windows,  then  we  reported  the  true  positive  rate  (TPR), 
true  negative  rate  (TNR),  positive  prediction  value  (PPV) 
and  negative  prediction  value(NPV)  of  our  classifier. 
Also  the  precision /recall  curve  on  testing  data  is  re¬ 
ported.  The  results  are  shown  in  Figure  5  for  bathroom 
and  kitchen  scenes.  We  can  clearly  notice  that  the  per¬ 
formance  is  improved  when  combining  depth  data  with 
RGB  image.  The  quantitative  results  are  given  in  Table  1. 


Figure  5:  Precision /Recall  curves  for  models  trained 
on  RGB  only,  depth  only  and  both.  Left  is  the  result  of 
bathroom  scene,  right  is  that  of  kitchen  scene. 


Testing  on  proposed  windows  In  the  detection  stage 
traditional  approaches  [2]  examine  all  possible  window 
aspect  ratios  and  all  possible  scales  by  generating  in¬ 
creasingly  complex  scoring  functions  and  greedily  se¬ 
lecting  the  candidates  in  the  subsequent  steps.  In  our 
case  we  use  the  learned  distribution  of  actual  object 
sizes,  to  determine  the  apparent  sizes  of  windows  to 
be  scored  at  selected  locations.  To  further  reduce  the 
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bathroom 

TPR 

TNR 

PPV 

NPV 

RGB 

0.6206 

0.8749 

0.5708 

0.8959 

Depth 

0.6257 

0.8321 

0.4996 

0.8924 

RGB-D 

0.6166 

0.9125 

0.6537 

0.8988 

kitchen 

1 

RGB 

0.8423 

0.8403 

0.2407 

0.9888 

Depth 

0.8162 

0.8442 

0.2395 

0.9871 

RGB-D 

0.856 

0.8817 

0.3031 

0.9903 

Table  1:  Testing  results  of  different  training  models  for 
bathroom  and  kitchen. 

amount  of  locations  visited  we  first  over-segment  the 
RGB  image  into  superpixels.  We  have  used  two  differ¬ 
ent  over-segmentation  strategies  [10]  and  [18]  on  the 
order  of  <  1000  small  superpixels.  At  the  center  of  ev¬ 
ery  superpixels  the  windows  are  generated  according  to 
the  learned  distribution  of  aspect  ratios  and  scales.  The 
results  on  4  bathroom  scenes  and  10  kitchen  scenes  are 
shown  in  Figure  6,  where  the  odd  columns  are  ground 
truth  and  every  even  columns  are  our  results.  Our  ap¬ 
proach  tends  to  detect  all  the  small  objects  in  the  frame, 
although  in  some  cases  some  objects  are  not  labeled  in 
the  ground  truth,  an  example  is  shown  in  right  image  of 
last  row  in  Figure  6. 

In  order  to  evaluate  the  accuracy  of  the  proposed  ap¬ 
proach,  25%  of  boxes  ground  truth  boxes  have  been 
correctly  detected  by  our  approach.  For  evaluation  we 
use  the  PASCAL  VOC  (intersection /union)  score  of  0.5. 
Despite  the  apparent  improvement  while  visually  exam¬ 
ining  the  results,  there  are  several  reasons  for  low  values 
of  the  score.  For  small  objects  the  PASCAL  criterion  of 
0.5  is  rather  strict  and  it  is  often  the  case  that  the  location 
of  many  ground  truth  bounding  boxes  have  errors  which 
exceed  the  score.  Another  side  effect  of  the  ground  truth 
labeling  is  the  fact  that  many  objects  are  labelled  as  a 
group  and  many  objects  which  we  successfully  detect 
are  not  labelled  at  all.  We  also  suffer  at  certain  locations 
from  errors  in  misalignment  of  image  and  depth  bound¬ 
aries  which  are  due  to  in  painting  algorithms  used  to  fill 
the  missing  values.  In  the  supplemental  material  [4]  we 
present  a  comparison  with  the  existing  approaches  for 
object  detection  [2]  and  [1]  using  the  code  made  avail¬ 
able  by  authors.  We  also  present  comparison  our  the 
sliding  windows  based  methodology  with  bottom  up 
saliency  based  methods  such  as  the  methods  used  in  [2] 
to  select  initial  windows.  Since  in  most  of  these  meth¬ 
ods  adopt  the  notion  of  saliency  of  local  neighborhood, 
by  measuring  the  difference  from  the  surroundings,  the 
presented  examples  clearly  demonstrate  the  problem  of 
these  methods  in  cluttered  environments. 

4  Conclusions 

We  have  presented  a  method  for  detecting  simple  ob¬ 
jects  in  cluttered  scenes  using  RGB-D  data.  In  order  to 
overcome  the  brittleness  of  the  boundary  based  meth¬ 


ods  (both  depth  and  image),  we  propose  to  adopt  a  dis¬ 
criminative  approach  using  intensity  and  depth  gradient 
features  computed  in  the  vicinity  of  the  bounding  cap¬ 
turing  the  notion  of  closed  boundary.  We  evaluate  the 
feasibility  of  the  objectness  measure  on  the  bounding 
boxes  selected  from  the  NYU  RGB-D  Dataset,  which  is 
typically  used  for  evaluation  of  semantic  segmentation 
and  considers  many  of  the  small  objects  as  part  of  the 
background.  In  the  actual  object  detection  stage,  we 
presented  a  method  for  exploiting  the  available  depth 
information  for  determining  the  apparent  size  of  the 
objects  and  significantly  pruning  the  number  of  win¬ 
dow  candidates  which  need  to  be  evaluated.  The  pre¬ 
sented  approach  shows  promising  results  as  well  as 
point  out  many  open  problems  with  the  current  eval¬ 
uation  pipelines  and  ground  truth  datasets.  Further 
improvements  can  be  achieved  by  incorporating  addi¬ 
tional  features  and  other  types  of  contextual  informa¬ 
tions  present  in  indoor  environments. 
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